Detecting possible pairs of materials for composites using a material word co-occurrence network

Composite materials are popular because of their high performance capabilities, but new material development is time-consuming. To accelerate this process, researchers studying material informatics, an academic discipline combining computational science and material science, have developed less time-consuming approaches for predicting possible material combinations. However, these processes remain problematic because some materials are not suited for them. The limitations of specific candidates for new composites may cause potential new material pairs to be overlooked. To solve this problem, we developed a new method to predict possible composite material pairs by considering more materials than previous techniques. We predicted possible material pairs by conducting link predictions of material word co-occurrence networks while assuming that co-occurring material word pairs in scientific papers on composites were reported as composite materials. As a result, we succeeded in predicting the co-occurrence of material words with high specificity. Nodes tended to link to many other words, generating new links in the created co-occurrence material word network; notably, the number of material words co-occurring with graphene increased rapidly. This phenomenon confirmed that graphene is an attractive composite component. We expect our method to contribute to the accelerated development of new composite materials.


Introduction
Interest in composites has increased recently, and the composite materials market is estimated to increase from USD 88.0 billion in 2021 to USD 126.3 billion by 2026 [1].Composites are materials that consist of two or more composition materials with considerably different chemical or physical properties.Composite materials are popular because they have two or more properties.For instance, carbon fiber reinforced polymers (CFRPs), one of the most popular composites, have both strength and lightness.CFRPs are applied in various fields, such as automotive and aerospace engineering, to reduce costs and energy consumption [2].Among composites of materials, those with completely different properties, such as Prussian blue and cellulose have attracted particular attention in recent years [3].Morinobu Endo, one of the pioneers of carbon nanofibers and carbon nanotubes (CNTs), is working on the compounding of innovative material combinations, such as CNTs and polymer materials [4][5][6].However, developing new composite materials is time-consuming, as one study reported that the development of new materials takes over 20 years [7].One reason for this phenomenon is that there is a very large number of potential material options; MatWeb, an online material database, has data on over 170,000 materials [8].Since many material combinations can be combined, finding new and compositable pairs of materials from them is difficult.
To solve this problem, material informatics (MI), which is an academic field combining material science and computer science, is attracting attention [9].This academic discipline aims to accelerate the process of designing and finding new materials by experimental data analysis.Some studies have reported that data analysis methods, such as neural networks, similarity measurements, and data mining, can be applied to predict the physical characteristics of new composites [10][11][12].Although MI is a new field, it has made rapid progress, leading to the development of high-performance composites [13].
However, some materials are unsuitable for MI; for example, some polymers have physical property values that are difficult to calculate and are unsuitable for MI [14].The limitations of candidate materials in MI may lead to the overlooking of new combinations of materials.Thus, developing a method to investigate materials from a wider viewpoint to find new pairs is needed.
MI is notable; however, the usefulness of bibliometric networks for discovering new materials has been proven.One study succeeded in predicting new heat-conductive materials with a network of knowledge extracted from scientific publications [15].Since new scientific knowledge is generated in existing knowledge networks, it is important to consider a prior knowledge network of material science to predict new materials [16].Scientific papers contain various information about scientific knowledge relationships, such as citation relationships.Regarding journal papers on composite materials, the authors describe pairs of materials selected as components for composites; "Graphene-Polyaniline" and "TiO 2 /Graphene" are examples of material pairs [17,18].
Link prediction in a network is a method to detect a possible combination between many candidates.This technique predicts the existence of a link between two nodes from structural changes in a network.Link prediction is applied in various situations, such as in the prediction of technological spinoffs that are used in unanticipated field technologies in an industry and in the combinations of promising research collaborators; link predictions of networks of interacting proteins have been applied to predict protein functions [19][20][21].There are several link prediction methods, one of which is based on information related to the network structure [22].This technique refers to information that describes the link structure around the nodes, for example, follower/followee relationships on social networks and protein-protein interaction relationships.Examples of network structure indexes are the common neighbor (CN), Jaccard coefficient (JC), resource allocation index (RA), Adamic/Adar index (AA), and preferential attachment (PA) [23][24][25][26][27].
By considering the usability of bibliometric networks for discovering new materials and link prediction techniques for detecting possible pairs, we hypothesize that new composite materials can be predicted by performing link prediction on the co-occurrence networks, with the material words described in the paper being nodes and their co-occurrence relationships being links.The purpose of this study is to predict compositable pairs of materials from a larger number of candidates than in previous studies for investigating materials from a broader perspective to discover new combinations.We assume that using bibliographic information has the potential to consider a larger number of materials than physical information because bibliographic information contains data on a high number of materials; additionally, information on materials described in academic papers can be extracted from databases.More than 150000 papers on composite materials are stored in Web of Science (WoS) databases, which is an online subscription-based scientific citation indexing service maintained by Clarivate Analytics.In this study, we extract the bibliometric information of materials and conduct link prediction of the co-occurrence network of material words to detect new and compositable material pairs.

Method outline
Our method involved the following four steps (Fig 1).The details of each step are as follows: • Extracting papers on composite materials • Listing material words from collected papers • Creating a co-occurrence network of material words • Link prediction of a co-occurrence network of material words

Extracting scientific journal papers describing composite materials
To extract academic papers on composite materials, we obtained bibliographic data related to composite materials from the Science Citation Index and the Social Science Citation Index gathered by the Institute for Science Information.We used WoS for accessing these databases to collect academic articles published over a wide range of years because the WoS databases include journal publication records from a broader span of years than those of other databases.We analyzed a citation network as follows.First, we searched for academic publications from the period before 8/15/2020 using the query "composite material*" (the asterisk* indicates a wildcard that can help to locate the appropriate results).Second, we created a citation network of the extracted papers with the obtained citation data.Third, papers not linked to others were excluded because we considered papers without citations of other studies to be irrelevant to the primary subject (composite materials).In this study, we analyzed the extracted papers included in a maximum component of a citation network.

Listing material words from collected papers
We extracted material words from the collected academic articles and counted the numbers of scientific papers in which each material word occurred by following an academic landscape system [28].We selected the 100 most frequently appearing material words and named them "the 100 material words".We only analyzed material words with high frequencies of occurrence because material words with low frequencies were less likely to co-occur with others.

Creating a co-occurrence network of extracted material words
We examined whether the two specified material words co-occurred or not, that is, whether there was a journal including both of them.We created a co-occurrence network of nodes and links representing the 100 material words and their co-occurrence relationships, respectively.
We created five co-occurrence networks using the extracted papers published on dates before 12/31/2009, 12/31/2010, 12/31/2011, 12/31/2012, and12/31/2012.Material word pairs that co-occurred for the first time in each period were identified by taking the difference in the material word co-occurrence network between the start and end of each period.For example, for TRP1, links with co-occurrence networks were not found in the extracted papers that were published by 12/31/2009, but links were found in the extracted papers that were published by 12/31/2012; therefore, that date was regarded as the first instance of co-occurrence for TRP1.
Next, we calculated the scores of the material word pairs with the network structure index.We used eight network structure indexes-CN, JC, RA, AA, PA, common neighbors using community information (CNSH), the internal resource allocation index using community information (RASH), and the intercluster measure (WIC)-because network structure indexes were deemed appropriate for the link prediction of small networks [32].Each node pair (x, y) index was calculated according to the following equations (Table 1) [23][24][25][26][27][33][34][35].In this study, CNSH, RASH, and WIC were calculated with the material class; the details of this step are described in the results section as community information.
Using the score of the network index, we judged which combinations of training periods and network indexes were the most appropriate for link prediction with the following steps.First, we counted the true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs) of the link prediction process by setting each score as a cutoff in a training period; from there, we selected the best cutoff of each combination of a training period and a network index (Fig 2).TP and TN were outcomes where the link prediction correctly predicted the positive and negative classes, respectively, whereas FP and FN were outcomes where the link prediction incorrectly predicted the positive and negative classes, respectively.Second, we calculated the sensitivities and specificities from the TPs, FPs, TNs, and FNs with Formulas (1) and ( 2) [36].The sensitivities and specificities represent the accuracies of the positive and negative predictions, respectively.In our study, sensitivity or specificity referred to the discovery accuracies of pairs of materials that could or could not be composited.Third, we plotted the receiver operating characteristic (ROC) curves, displaying (sensitivity, 1-specificity), as shown in Fig 3 .Finally, we calculated the average accuracies and areas under the curve (AUCs) and evaluated the accuracies of the link prediction processes in each of the 24 cases (the cases were combinations of eight network structure indexes and three training periods) based on these values.The average accuracy was determined by the average values of the sensitivities and specificities (Formula (3)).AUC represented the area between the horizontal axis and the ROC curve, plotting (sensitivity, 1-specificity), as shown in Fig 3 and Formula (4) [37].The AUC calculation required a value of area from 0% to 100%: the higher the area values Table 1.Score for node pairs {x, y} under link prediction using each index (where Γ(x) = the number of neighbors of node x in the cooccurrence network, f(u) = 1 if x and y belong to the same community and f(u) = 0 otherwise, and δ = the arbitrary constant (the Default Value is 0.001)).

Extracting scientific journal papers on composite materials
A total of 75076 papers containing at least one keyword were collected.We focused on the maximum connected component, which accounted for approximately 59% of the total papers (44430 of 75076 papers).

Listing material words from collected papers
The 100 material words extracted from the collected papers were classified into 4 classes: carbon, ceramic, metal and organic materials (Table 2); the numbers of material words categorized into each class were 7, 41, 10 and 42, respectively.The number of scientific papers with the top 10 and top 100 most frequently occurring material words are shown in Table 3 and S1 Table, respectively.This number differed widely among the various material words; graphene showed the maximum occurrence number at 8736, and calcium hydroxide showed the minimum occurrence number at 36.The top 10 most frequently occurring material words comprised 3 carbon, 2 ceramic, 2 metal, and 3 organic materials.Although carbon materials accounted for a small proportion of the 100 material words, most of them appeared in many scientific papers.

Counting the co-occurrence between two material words
We counted the number of material words that co-occurred with the 100 material words in 2012 or 2015, and we named these material words "co-occurring material words".The numbers of co-occurring material words of the top 10 and top 100 most commonly occurring material words are shown in Table 4 and S2 Table , respectively.This number varied greatly depending on the material words, as the maximum and minimum number of co-occurring words in 2015 were 85 for silica and 2 for vanadium phosphate.
We calculated the average percentages of co-occurrence in the material word pairs for each material class and compared these values (Table 5).The percentages of co-occurrence with the 100 material words in each material class for carbon, ceramic, metal, and organic materials were 56.3%, 34.5%, 56.8%, and 39.0%, respectively.The percentage of co-occurrence of the carbon and metal material classes was high (85.7%),whereas that of the ceramic and organic material classes was low (29.6%).

Link prediction of a co-occurrence network
To identify the conditions with high link prediction accuracies, we conducted link prediction for 24 patterns (the patterns were combinations of eight network indexes and three training periods) and compared the results.The AUCs and average accuracies of link prediction in the patterns are shown in S3 Table, and the ROCs are shown in S1-S8 Figs.
By comparing the results for each training period, it was determined that the training period with the highest values depended on the network index.The average AUC values using TRP1, TRP2, and TRP3 were 69.6%, 69.9%, and 69.6%, respectively; these values were nearly the same.However, the training period with the highest AUC differed according to the network index; for example, link prediction using CN and CNSH showed the highest values with TRP2 and TRP1, respectively.The average accuracies using TRP1, TRP2 and TRP3 were 63.9%, 65.1% and 64.8%, respectively; these values were nearly the same.Nevertheless, the training period with the highest average accuracy differed according to the network index; for example, link prediction using CN and CNSH showed the highest values with TRP3 and TRP1, respectively.Since the training period with the highest values differed according to the network index as above, the best training period for the link prediction depended on the network index.
From the above results, the following pairs of network indexes and training periods with higher AUCs and average accuracies were determined to be appropriate for link prediction in this study: {CN, TRP2}, {CN, TRP3}, and {CNSH, TRP1}.To evaluate the usefulness of these pairs, we conducted link prediction during the testing period and calculated the accuracies with these pairs.First, we counted the TPs, FPs, FNs, and TNs of each cutoff with CN and CNSH and calculated the sensitivities, specificities and average accuracies (Tables 6 and 7).Next, we conducted link prediction with the following three cutoffs (defined as positive): CN � 9, CN � 5, and CNSH � 8, which were calculated using {CN, TRP2}, {CN, TRP3}, and {CNSH, TRP1}, respectively; since the average accuracies of link prediction using each pair were 64.7%, 64.1% and 56.9%, respectively, link prediction using {CN, TRP2} showed the highest accuracy.CN � 7 showed the highest average accuracy (64.8%) of cutoffs using CN; this value was close to the average accuracy using CN � 9 (64.7%).Therefore, {CN, TRP2} showed a high link prediction accuracy during the testing period.In the testing period, link prediction using CN�9 successfully predicted the co-occurrence of 364 of 666 co-occurrence material word pairs, and it successfully predicted the non-co-occurrence of 5036 of 6736 non-co-occurrence material word pairs.Link prediction with CN�9 showed lower sensitivity (54.7%) and higher specificity (74.8%) than that using CN�7, of which the sensitivity and specificity were 66.7% and 62.9%, respectively.In agreement with the previous results, {CN, TRP2} was regarded as an appropriate pair for link prediction during the testing period.

Listing material words from collected papers
Even though carbon accounts for a small proportion of the 100 material words, most of them appear in many scientific papers.This observation implies that carbon materials exhibit high levels of usefulness as components of composites.For example, graphene, which appears in most papers mentioning some of the 100 material words, is composed of many kinds of materials, including organic, ceramic and metal materials, and its wide-ranging applications include battery electrode materials, reinforced plastics, photocatalytic materials, and cell culture basics [39][40][41][42].CNTs, which appear in the third most papers out of the 100 material words, are compounded with many kinds of materials, including ceramic, metal and organic materials [43,44].The applications of CNT composites vary; for example, CNT composites have been applied in reinforced plastics, electrode materials and wearable devices [45].In addition, CNT composite development is expected to accelerate in the future because of the establishment of mass production methods for CNTs [46].In contrast, calcium hydroxide, which occurs in the fewest scientific papers of the 100 material words, is compounded mainly with resin.The application range of calcium hydroxide is narrower than those of graphene and CNTs, as calcium hydroxide-based compounds are primarily used for crown restoration [47].
Prof. Bunshi Fugetsu of the University of Tokyo, who is an expert in composite materials, claimed that this result represents the attractiveness of carbon materials.Graphene and CNTs have high reactivity and strength as components of composites.Plus, many composites can only be realized with them because of their unique shape.
As a result of our method, the number of co-occurring material words varies widely.This phenomenon occurs partly because the ease of handling is different in each material.For example, graphene and CNTs are manageable due to their high reactivity and strength, but fullerene is not manageable because of its poor solubility [61].Fullerene is relatively uncommon as a research theme because research on fullerene is likely to take more time to make discoveries.However, Prof. Fugetsu states that fullerene is a very interesting material and is worth to be researching.

Link prediction of a co-occurrence network
The cutoff calculated from {CN, TRP2} shows high accuracy for link prediction in both the training period and the testing period.We discuss the cause of this result below.
First, we examined why CN and CNSH showed high link prediction accuracy in the training periods.We concluded that this phenomenon occurs because CN and CNSH have similar properties to the created co-occurrence networks of material words.CN and CNSH show that two nodes having a link to an in-common node are more likely to have a link than those without links to in-common nodes.Nodes in the created co-occurrence networks of material words that are linked to many in-common nodes tend to have new links (details are described in the next subsection).Nodes with many neighbors tend to have many common neighbors with another node.In other words, link prediction using CN or CNSH is more likely to correctly predict new links, and it shows high accuracies and AUCs.However, RA, AA, and RASH are based on the theory that two nodes linked to a node with few in-common neighbors are likely to have a link.Thus, link prediction using these indexes tends to show that nodes linked to few in-common neighbors have new links in the material word co-occurrence network, even though nodes with many neighbors are more likely to have new links.RA, AA, and RASH show lower accuracies and AUCs than CN and CNSH because their nodes linked to few in-common nodes do not have new links.Next, we discuss why link prediction using CNSH shows low average accuracies and AUCs in the testing period.This result indicates that adding community information (in this case, material class) reduces the link prediction accuracy; in other words, the instances of co-occurrence between two material words in different material classes increase.The number of material words in which graphene and CNTs co-occurred in the testing period increases rapidly from 43 to 93 and from 76 to 91, respectively.Since only five of the 100 material words are carbon materials, 95 of the 100 material words are noncarbon materials; in other words, graphene and CNTs co-occur with many material words in other material classes.
Finally, we infer why the optimal training period differs by network index.We assume that this is in part because the instances of co-occurrence between material words in different classes increase rapidly.Link prediction shows high accuracy in TRP1 in the case of using network indexes that add community information (CNSH, RASH, and WIC); however, it tends to exhibit low accuracy in TRP1 when using network indexes that do not add community information (CN, JC, RA, AA, and PA).This result indicates that linking old data (in this case, TRP1) is optimal when using network indexes that add community information, such as CNSH, because there are few instances of co-occurrence between two material words in different material classes in old data.As above, we conclude that network indexes that do not add community information are appropriate for new data, and other indexes are appropriate for old data.

Evaluation of the link prediction accuracy
The cutoff calculated from {CN, TRP2} is defined as the best in our method because it shows high accuracy in both the training and testing periods.This cutoff shows high specificity (74.8%) during the training period and is regarded as a useful cutoff for the following reasons.Values evaluating the accuracy of true/false predictions depend on the case.In the case of cancer screening, sensitivity is more important than specificity because positive cases (i.e., patients who have cancer) must not be overlooked.In contrast, cold assessment emphasizes specificity because it is important to reduce false positives (i.e., patients who do not have a cold but are diagnosed with it) to accelerate examination.We consider specificity to be more important than sensitivity in our method because researchers need to avoid research themes that are unlikely to produce results due to time and budget limitations.This finding is also because an emphasis on sensitivity may lead to a focus on research themes that are only likely to produce results, decelerating innovation.Since innovation is based on diverse knowledge, researchers must broaden their research scope to avoid disregarding innovative discoveries [62].From the aforementioned theories, we consider the cutoff calculated from {CN, TRP2} to be useful, indicating that we have succeeded in calculating a useful cutoff with the testing data.
Next, we analyze the common characteristics of material word pairs for which link prediction with the best cutoff cannot predict co-occurrence.The ratio of FN material pairs for each material word is calculated from the following formula, and we determine the material words of the co-occurrence that were overlooked based on the ratio.
A: FN rate of material word pairs for the material word x, B: Material word pairs containing x that co-occur during TRP2, C: Material word pairs that are not predicted to co-occur.
As a result, the FN of the following 12 material words is 100%: carboxymethyl cellulose, polydopamine, V 2 O 5 , polylactide, C 3 N 4 , Li 3 V 2 , MoO 3 , cyclodextrin, melamine, CaCl 2 , polyetherimide and LiCl.These material words tended to co-occur with a few material words in 2012, as cyclodextrin co-occurred with fewer than 10 material words.On the other hand, the FNs of the following material words were 0%: gold, PVA, polyurethane, aluminum, PMMA, BaTiO 3 , nylon, polyimide, B 4 C and titanium diboride (TiB 2 ).As they co-occur with more than 10 material words, they tend to have co-occurrence with more words than those for which FN is 100%.Therefore, our method is likely to show higher accuracy for predicting co-occurrence with material words that already co-occur with many words.
Because material words that co-occur with many words are more likely to have new instances of co-occurrence, we assume that material words that co-occur with those already having many co-occurring words can be predicted by focusing on non-co-occurring words.To verify this, we calculate the percentage of co-occurrence from 1/1/2016 to 8/15/2020 of the 100 material words, and the values of the top 10 most frequently occurring material words.The 100 material words are shown in Table 8 and S4 Table, respectively.As the number of co-occurring material words in 2015 and the co-occurrence rate between 2016 and 2020 show a high correlation coefficient of 0.732, we find that in general, the greater the number of co-occurring material words is, the higher the percentage of co-occurrence.While the average of 100 material words is 14.1%,only graphene and CNTs, which co-occur with more than 80 material words in 2015, show a high co-occurrence rate exceeding 50%.From the above results, the prediction of the co-occurrence of material words that already have many instances of co-occurrence is highly possible by focusing on the material words that are not unreasonable.

Conclusions
Innovative material combinations were emphasized.However, it was difficult for previous methods to detect possible pairs of materials from a wide range of materials because the physical properties of some materials such as polymers were hard to quantify.To solve this problem, we predicted possible pairs of materials by conducting link prediction on the co-occurrence network under the assumption that pairs of material words co-occurring in scientific papers on composites are reported as composite materials.Our MI method analyzed various kinds of materials including polymer materials such as cellulose and succeeded in searching compoundable material combinations from thousands of pairs, which was far more than those in previous studies [10][11][12].Our method exhibited the potential to promote compoundable and innovative pairs of materials by the cross-sectional exploration of materials.
The limitation of our method was that its specificity and prediction accuracy of material words that co-occurred with fewer others were low.Thus, our future work is to conduct link prediction on only material words co-occurring with fewer others to find better conditions.In addition, we plan to try other network indexes and/or implement multiple indexes to improve the sensitivity and specificity.

S1
Fig. ROC curves for CN with each training period.(TIFF)

Table 3 . Number of scientific papers in which each of the top 10 most commonly occurring material words appeared. Material word Number of papers in which the word occurred Material word Number of papers in which the word occurred
https://doi.org/10.1371/journal.pone.0297361.t003